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Abstract 

Many large arithmetic computations rely on tables of all primes less than n. For example, 
the fastest algorithms for computing n! takes time 0(M(nlogn) + ?('«.)), where M(n) is the 
time to multiply two n-bit numbers, and P(?t,) is the time to compute a prime table up to n. 
The fastest algorithm to compute („^ 2 ) ^ prime table. We show that it takes time 

Cl(M(n)+P(n)). 

In various models, the best bound on P(n) is greater than M(nlogn), given advances in 
the complexity of multiplication In this paper, we give two algorithms to computing 

prime tables and analyze their complexity on a multitape Turing machine, one of the standard 
models for analyzing such algorithms. These two algorithms run in time (!I(M(n log n)) and 
(!I(nlog^n/loglogn), respectively. We achieve our results by speeding up Atkin’s sieve. 

Given that the current best bound on M(n) is nlogn 2 ®(*°s n) ^ second algorithm is faster 
and improves on the previous best algorithm by a factor of log^logn. Our fast prime-table 
algorithms speed up both the computation of nl and („y) ■ 

Finally, we show that computing the factorial takes e(M(nlog^^^“^ n)) for any constant 
£ > 0 assuming only multiplication is allowed. 

Keywords, prime tables, factorial, multiplication, lower bound 


1 Introduction 

Let P(n) be the time to compute prime table T„, that is, a table of all primes from 2 to n. The best 
bound for P(n) on a log-RAM is 0(re/loglogn), using the Sieve of Atkin, and 0(nlog^ nloglogre) 
on the multitape Turing machine (TM), a standard model for analyzing prime table computation, 
factorial computation, and other large arithmetic computations [1311251126] . This TM algorithm is 
due to Schonhage et al. [25| as is based on the Sieve of Eratosthenes. 

The main result of this paper is two algorithms that improve the time to compute on a 
TM. One runs in 0(nlog^ n/loglogn) and thus speeds up Schonhage’s algorithm by a factor of 
log^ log n. 

The other has a running time that depends on the time to multiply large numbers. Let M(a, b) 
be the time to multiply an a-bit number with a 6-bit number, and let M(a) = M(a, a). We make 
the standard assumption m that f{n) = M(n)/n is a monotone non-decreasing function. Then 
we give a prime-table algorithm that runs in time 0(M(nlogn)) on a TM. Fiirer’s algorithm [13] 
gives the best bound for M(n) on a TM, which is nlogn2^(*°§ a bound that was later achieved 
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by a different method by De et al. [8], so our second algorithm is currently slower than the first 
algorithm. 

Prime tables are used to speed up many types of computation. For example, the fastest algo¬ 
rithms for computing n! depend on prime tables [6ll25 [ [28]. Schonhage’s algorithm [25] is fastest 
and takes time 0(M(nlogn) -|- P(n)). 

The number of bits in n! is 0(nlogn), and Borwein [6] conjectured that computing n! takes 
©(M(nlogn)) time. On the log-RAM, Fiirer [TTj showed that M(n) = 0{n). So on the log- 
RAM, the upper bound of Borwein’s conjecture seems to be true, since M(nlogn) dominates 
0(n/loglogn) for now. 

On a TM, there is a simple lower bound of f2(nlogn) to compute n!, since that is the number 
of TM characters needed to represent the output. This contrasts with the 0(n)-word output on 
the log-RAM. On the other hand, no 0(M(nlogn))-time algorithm was known in this model, since 
before our improved prime-table algorithms, P(n) dominated M(re log n jl]. Using our 0(M(n log n))- 
time prime-table algorithm, the time to compute n! is improved to 0(M(nlogre)). If Borwein’s 
conjecture turns out to be true, this algorithm will turn out to be optimal for computing n\. 

Another use of prime tables is in the computation of binomial coefficients. The exact complexity 
of computing binomial coefficients hadn’t been analyzed, but here we show that a popular algorithm 
takes time 0(M(re)-|-P(n)). Thus our faster algorithm also improves this running time by log^ logn. 

Finally, we consider lower bounds for computing n!. Although we do not produce a general 
lower bound for computing n! on a tMI, we do show a lower bound for algorithms on the following 
restricted model. We do not restrict which operation can be used but we assume that the factorial 
n! is output by a multiplication. We assume that a multiplication can only operate on two integers, 
each of which can be an integer of o(nlogn) bits or a product computed by a multiplication. Under 
this restriction, we show a lower bound 




max 

t 


M^i/ 2 -e I -n log n 1, —n log n 


w 


for t E [1, n]. 


( 1 ) 


where w denotes the word size in the model. Given an upper bound and a lower bound for M(n), 
we can simplify the lower bound in Equation ([T|). 

On the Turing Machine, we know that M(n) has a simple linear lower bound Q(n) and, due to 
Fiirer m and De et al. [8], we have an upper bound M(n) = relogn2^(^°® In that case, we 
have a lower bound in the multiplication model of 

D(M(nlog^/'^“^ n)) for any constant e > 0. 


On the log-RAM, we know that M(n) has a lower bound of D(n/logn) because operations on 
O(logn) bit words take at least constant time. The upper bound for M(n), also due to Fiirer [T3], 
is 0(n). In that case, under the multiplication restriction, we have the same lower bound as on the 
Turing Machine. They coincide because both models have a log^"’"^ n gap between the lower and 
upper bounds of M(re). 


Organization. In Section[2l we present the related work for computing prime tables. We propose 
two algorithm in Section [S] Last, in Section 01 we show a lower bound of computing factorials. 
The related work and new upper bounds for factorials and binomials can be found in the appendix. 
Sections El El 

^We note that before Fiirer’s algorithm, the opposite was true. This is because before Fiirer’s algorithm, the best 
bound on M(n) was C>(n logn log logn) | 26| . 

^And indeed, such a result would be a much bigger deal than any upper bound! 
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2 Background and Related Work 


In this section, we present the relevant background and related work on computing prime tables 
and defer those for factorials and binomial coefficients to Section [5l 

The Sieve of Eratosthenes is the standard algorithm used in RAM model. It creates a bit table 
where each prime is marked with a 1 and each composite is marked with a 0. The multiples of 
each prime found so far are set to 0, each in 0(1) time, and thus the whole algorithm takes time 
~ O(nloglogn). However, on a TM, each multiple of a prime cannot be marked in 0(1) 
time. Instead, marking all the multiples of a single prime takes 0(n) time, since the entire table 
must be traversed. Since any composite number up to n has some prime factor of at most ^/n, and 
there are 0(^/n/ logn) such primes, this approach takes time log n). 

Schohage et al. give an algorithm to compute a prime table from 2 to n in 0(nlog^ n log log re) 
time [25]. His algorithm, for each prime p < y/n, generates a sorted lisfH of the multiples of p, 
and then merges the 0(yfnl log re) lists so generated. The total number of integers on these lists is 
O(reloglogre), each integer needs to be merged O(logre) times, and each integer has O(logre) bits. 
Therefore, Schohage’s algorithm has running time O(re log^ re log log re). 

Alternatively, one can use the AKS primality test [Tj on each integer in the range from 2 to 
re. The fastest known variant of the AKS primality test is due to Lenstra and Pomerance and 
takes 0(\og^n) time per test on a TM. If Agrawal’s conjecture [T] is true, it takes O(log^re) time. 
Whether the conjecture is true or not, it would still take II(re log^ re) time to compute a prime table. 
One can use the base-2 Fermat test. 


2” = 2 (mod re), 

to screen out a majority of composite numbers. This would take 0(relogreM(logre)), which is 
dominated by the AKS phase. All prime numbers and o(re/logre) composite numbers can pass the 
base-2 Fermat test m- Therefore, it reduces the complexity by a log re factor. In this case, it would 
take a finer analysis of AKS and settling Agrawal’s conjecture to determine the exact complexity 
of this algorithm. It would likely take O(relog^re) = ©(relog^ relog^ logre) for some A; > 0, and this 
would improve on Schohage’s algorithm if k < 1. 

We show how to implement the Sieve of Atkin to achieve a running time 
min{0(re log^ re/log logre), 0(M(re log re))} on the Turing Machine in Section [S] 


3 Fast algorithms for Atkin’s Sieve 

In this section, we give two algorithms for implementing Atkin’s Sieve on a TM. The first runs 
in time O(re log^ re/log log re). The second runs in time 0(M(re log re)). Given the state of the art 
in multiplication, the first is faster. We present both, in case a faster multiplication algorithm is 
discovered. 

3.1 Atkin’s Sieve in (9(re log^ re/log log re) 

We define some notions before proceeding to the proof. A squarefree integer denotes an integer 
that has no divisor that is a square number other than 1. Let Nj( 3 ,^^)(A:) = 0 if there are even 
number of integer pairs (x,y) that have x > 0,y > 0 and f(x,y) = k; or 1, otherwise. Similarly, let 

®It is not the case that each list occupy a tape; otherwise, (^(1) tapes are required. To merge these lists, put half of 
the lists on a tape, half on the other, merge them pairwise, output the sorted lists on another two tapes and recurse. 
In this way, 4 tapes are enough. 
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^/(x = 0 if there are even number of integer pairs (x, y) that have x > y > 0 and f{x, y) = k; 

or 1, otherwise. The key distinction is that the latter requires that x > y. In [2], Atkin and 
Bernstein show how to test primality based N and N', as shown in Theorem 13.II 

Theorem 3.1 ([2], Theorems 6.1-6.3]) For every squarefree integer A: G 1 + 4N, k is prime iff 
^^ 2 j^^y 2 {k) = 1; for every squarefree integer /c G 1 + 6N, k is prime iff N^ 2 _^_^y 2 {k) = 1; for every 
squarefree integer A; G 11 + 12N, k is prime iff N'^^ 2 _y 2 {k) = 1. 

We show how to compute N 3 , 2 _,_ 4^2 (A:) for all k G [l,n] in 0(nlog^ n/loglogn) time. First, for 
each X G [1, one can enumerate a short list of + 4-1^, + 4- 2^,..., + 4- Clearly, 

each short list is already sorted. Then, we merge short lists pairwisely until a single sorted list is 
obtained; therefore, the running time is O(nlog^n) because there are 0{n) integers, each of which 
has O(logn) bits and is encountered O(logn) times in the merge process. 

To speed up this process by a factor of log log n, noted in [2], Atkin and Bernstein show that the 
integers on these short lists are seldom coprime to the first log^^^ n primes. There are Oinj log log n) 
such integers in total. One can speed up this process by screening out the integers on these 
short lists that are not coprime to the first log^^^ n primes. This filter step can be completed in 
0(nlog^'^^ nM(logn)) time and the reduced short lists can be merged in the desired time. The 
same technique can be applied to N^ 2 _^_^y 2 {k) and N'^^ 2 _y 2 ik) for all k G [l,n]. 

Lemma 3.2 Computing Nj, 2 + 4 y 2 (A;), Fi^ 2 ^^y 2 {k) and FS'^^ 2 _y 2 ik) for all k in [l,n] takes 
0(nlog^ n/log logn) time on the Turing Machine. 

We computed the Atkin conditions but now we need to get rid of all non-squarefree num¬ 
bers. Therefore, we show that generating all non-squarefree numbers requires O(nlogn) time in 
Lemma 13.31 Merging these three lists followed by screening out the list of non-squarefree numbers 
gives a prime table, as summarized in Theorem 13.41 

Lemma 3.3 Generating a sorted list of all non-squarefree integers in the range [l,n] takes 
O(nlogn) time on the Turing Machine. 

Proof: We first generate the sorted list Li of all non-squarefree integers that has a divisor p^ 
for some prime p < logn. We initialize an array of n bits as zeros, for each prime p < logn, we 
sequentially scan the entire array to mark all mp^ for integer m by counting down a counter from 
p^ to 0. Note that it requires amortized 0(1) time to decrease down the counter by 1 due to the 
frequency division principle [1]. Since there are O (log n/log log n) such primes, the running time 
of this step is 0(n log n/log log n). We then convert the array into the sorted list Li as required, 
which takes O(nlogn) time. 

Next, we generate a sorted list L 2 of all non-squarefree integers that has a divisor p^ for some 
prime p > logn. We generate a sorted short list for each such prime p, containing all the integers 
mp^ < n for some integer m. Then, we merge these sorted short lists. Note that there are 
Ylp>\ogn'^/p‘^ = 0{n/ logn) integers on these short lists, each integer has O(logn) bits, and each 
integer is encountered O(logn) times in the merging process. The running time is thus O(nlogn). 
We are done by merging Li and L 2 . ■ 

Theorem 3.4 The prime table T^ from 2 to n can he computed on the Turing Machine in time 

P(n) = 0(nlog^n/log logn). 
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3.2 Atkin’s Sieve in C>(M(nlogn)) 

We show that sieve of Atkin can be realized in C)(M(relogn)) time on the Turing Machine. We 
apply multiplication to the computation of and for all k G [l,u]. The balance 

of the work will take O(nlogn), and will thus be dominated by the multiplication. 

An important aspect of the multiplication will be the number of bits needed in the multiplicands. 
For this, we need Lemma 13.51 stating an upper bound of the number of (integer) lattice points on 
the ellipses specified by the first two Atkin conditions and on the truncated hyperbola 2>x^ — y^ = k 
for X > y > 0 . 

Lemma 3.5 The number of integer pairs (x, y) that satisfy + 4y^ = k for any positive integer 
k coprime to 6 is bounded by kP^^I log log same bound holds for x^ + 3y^ = k and 3x^ — = 

k, X > y > 0. 

Proof: Observe that every pair (x, y) that satisfies x^ + 4y^ = k induces an unique pair [x' = 
X, y' = 2y) that satisfies x'^ + y'^ = k. Therefore, the number of pairs (x, y) that satisfies the latter 
equation is no less than that of the former. It is known that, for any odd integer k, there are 

< 2 ) 

\ d-\k / 

integer pairs (x',y') that satisfy x'^ + y'^ = k [M]. Since the number of divisors of an integer k is 
no more than O due to Wigert [9], an upper bound for ([2]) is Similarly, 

it is known that for any odd integer k there are 



integer pairs (x,y) that satisfy x^ + 3y^ = k [18], where (|) denotes the Jacobi symbol. Because 
each Jacobi symbol has value no more than 1, an upper bound for ([3]) is as desired. 

We argue that, for any integer k coprime to 6 , the number of integer pairs (x, y) that satisfy 
equation 3x^ — y^ = k,x>y>0 has the same bound. We first give a proof for the case that x, y, k 
are mutually relatively primes and then relax the restriction. 

Let k = Pi^P 2 ^ ''' Pt^ where the pfs are distinct primes more than 3 and the rfs are positive 
integers. Observe that every integer pair (x, y) that satisfy 3x^ — y^ = k, x > y > 0 has the property 
that X, y < Therefore, every integer pair (x, y) that satisfy 3x^ — y^ = k,x>y>0 induces an 
unique pair (x' = x mod k,y' = x mod k) that satisfies 3x'^ — y'^ = 0 (mod k) as well as induces a 
pair (x' = X mod ,y' = y mod pL) that satisfies 3x'^ — y'^ = 0 (mod pL). 

We claim that any integer pair (x, y) that satisfies 3x^ — y^ = 0 (mod k) has an unique product 
(yx“^ mod k), where the inverse x~^ exists since x and k are relatively prime. We give a proof by 
contradiction. Suppose (xi, yi) and (x 2 , y 2 ) yield the same product (yx“^ mod k), then yiX 2 = y 2 Xi 
(mod k) or, equivalently, yiX 2 = y 2 Xi due to xi,yi,X 2 ,y 2 < Since xi and yi are relatively 

prime, and X 2 and y 2 are relatively prime, then xi = X 2 , yi = y 2 , a contradiction. 

We show that the number of distinct products (yx“^ mod k) is at most 2 L Since (x' = 
xmodpL,y' = y modpL) satisfies 3x'^ — y'^ = 0 (modpL), (a* = y'x'~^ modpL) is a square 
root of 3 modulo pL. There are at most two distinct square roots of 3 for each modulo pL, 
Pi > 3 [21] Theorem 5.2]. By the Chinese Remainder Theorem, ( 01 , 02 ,... , 04 ) is in a one-to-one 
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correspondence to {yx~^ mod k). Hence, there are at most 2* distinct products {yx~^ mod k) as 
desired. 

Consequently, the number of integer pairs (x, y) that satisfy 3x^ — = /c, x > y > 0 for any 

integer k coprime to 6 is bounded by 

O for x,y,k are relatively primes. 

For the case that two of x, y, k have common divisor d > 1, then the third one also has the divisor 
d. Then, one can divide x, y, k by the common divisor d, thus reducing to a case of x, y, k' being 
mutually relatively prime for k' < k. There are such smaller k' and each smaller k' 

contributes pairs (x,y) at most. We are done. ■ 

Lemma 3.6 Given a function /(x, y) = ox^ + by"^ for a > 0,6 > 0, N f(x,y) (^) for all k £ [1, n] can 
be computed in 0(M(nlogn)) time. 

Proof: Any positive integer pair (x, y) that satisfies /(x, y) = k has the property that ax‘^,by‘^ < k. 
We claim that a long multiplication on a pair of 0{n log n)-bit integers suffices to compute (k) 

for all k E [1, n]. 

For i E [1, n], let Oj = 1 if some ox^ = i, or otherwise ai = 0. Similarly, for j E [1, n], let fdj = 1 
if some by‘^ = j, or otherwise /3j = 0. Then, the following product of polynomials 


has the property that the coefficient of modulo 2 is equal to Nj( 3 , One can use a multipli¬ 

cation to replace the product of polynomials by replacing z with an integer base B. To avoid carry 
issue, we choose B = 0(logn) because the coefficient of is at least bounded by O(n^). Thus, 
the running time is 0(M(nlogn)). ■ 

Corollary 3.7 Given functions f{x,y) = x^ + Ay'^,g{x,y) = x"^ + 3y‘^, Nf(^x,y){k) and ^g(x,y){k) 
for all k E [l,n.] can be computed in 0(M(nlogn/loglogn)) time. 

Proof: We use the algorithm stated in Lemma 13.61 but, due to Lemma 13.51 we can choose B to 
be ©(log re/log log re) rather than ©(log re). One needs to avoid the computation of Nf(^x,y)ik) for 
k not coprime to 6 because f(^x,y){k) might require more than ©(log re/log log re) bits for such k. 
We avoid the computation of N f(x,y){k) for such k by classifying x^, 4y^, 3y^ into groups according 
to their residue modulo 6. Then, multiplying these groups in pairs only if their sum is coprime to 
6, which amplifies the complexity by a constant factor. ■ 

Lemma 3.8 Given a function f{x,y) = 3x^ — ^ ^ [1)^1 oan be computed in 

0(M(relogre)) time. 

Proof: Any positive integer pair (x, y) that satisfies /(x, y) = k and x > y has the property that 
x,y < We claim that log re multiplications suffice to compute iVj.^^^^(A:) for all k E [l,re]. 

We relax the condition x > y by divide and conquer and then process each subproblem as 
Lemma [3.61 We reduce the range of pairs {x,y), 0 < y < x < re^/^ to following three cases, let 
h = re^/^/2: (1) x E [h, re^/^] and y E [0, h), (2) 0 < y < x < h, (3) /i < y < x < re^/^. 

Note that case (1) can be computed by the product of re-term polynomial as what was done in 
Corollary 13.71 due to Lemma 13.51 Therefore, case (1) can be done in 0(M(relogre/loglogre)) time. 
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Besides, the number of pairs (x, y) in cases (2) and (3) is half of that in the original case. To match 
the claimed complexity, we recurse for log log n levels, with a running time of 0(M(nlogn)) and 
generate O(logn) lists of pairs {x,y) sorted in ascending f{x,y) and we use the first algorithm in 
Lemma [3.3l to merge them into a sorted list Li in 0{nlogn) time. Note that, by the first algorithm, 
any pair of duplicated integers is discarded, since we only care about parity. After the recursion, 
the number of unprocessed pairs (x, y) is 0{n/ logn). We merge the unprocessed pairs (x, y) into a 
single sorted list L 2 in ascending f(x,y) by the second algorithm used in Lemma 13.31 which takes 
O(nlogn) time. Finally, we are done by merging Li and L 2 . ■ 

Combining Lemma 13.3113.81 and Corollary 13.71 we can realize the sieve of Atkin with a few of 
long multiplications and some minor procedures doable in O(nlogn) time. As a result, we have 
Theorem 13.91 

Theorem 3.9 The prime table from 2 to n can be computed on the Turing Machine in time 

P(n) = C>(M(nlogn)). 


4 Lower Bound 

We present a lower bound for computing the factorial n!. We do not restrict which operation can 
be used but we assume that the factorial n! is output by a multiplication. We assume that a 
multiplication can only operate on two integers, each of which can be an integer of o(nlogn) bits 
or a product computed by a multiplication. Under this assumption, we show that computing the 
factorial n! has a lower bound fl(M(n log^'^^”^ n)) for any constant e > 0. 

To show the claimed lower bound, we need some lemmas for M(n) and Mfc(n), where Mfc(n) 
denotes the optimal time to multiply k pairs of two n-bit integers. There is a subtle difference 
between Mfc(n) and kM{n). Mfc(n) denotes the optimal time to multiply k pairs of integers, possibly 
in parallel, because all these integers are given at the beginning; however, kTA{n) denotes the 
optimal time to multiply k pairs of integers serially, one after another. Hence, Mfc(n) < A:M(n). 
Lemmas imoi are simple facts about the Turing Machine model. Lemma 14.21 is based on the 
property of progression-free set . 

Lemma 4.1 M(a, b) = fl(a -|- b) and M(a, b) = n(M(a)) if a < b. 

Proof: M(a, 6) = n(a -|- b) clearly holds on the Turing Machine model. To compute the product 
of two o-bit integers, every bits of the integers has to be read. On a Turing Machine, one can read 
one character in a step. Since the alphabet set has constant size, every character can encode 0(1) 
bits. 

We prove M(a,6) = n(M(a)) if a < b by padding zeros. Suppose M(o, 6) = o(M(a)), then 
b = o(M(a)). To multiply two a-bit integers, one can pad b — a = o(M(a)) zeros to one a-bit 
integer and then multiply. In this way, the total running time is M(a, 6) -|- o(M(a)) = o(M(a)), 
contradicting the optimality of M(a). ■ 

Lemma 4.2 The products of independent short multiplications can be computed by a long multi¬ 
plication; in particular, 

M^(a) = 0(M{ka)), 

where logA:^ < a and i = kf~^ for any constant e > 0. 
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Proof: We represent a /ca-bit integer A with a sum of k terms 

A = aQ + a\x^ + a 2 X^ + • • • + ak-\x^~^ where the base x = 2°' 

and likewise for B. We initialize Oj’s and /3’s with zeros. For each short multiplication u x v, we 
assign ai = u and I3i = v for some index i, preserving the following condition. We require that the 
set S of assigned indices be progression-free; that is, for every i,j,h £ S, j + j 2h. In this way, 
if we do the multiplication C = AB, then the product of matched u,v is placed at the coefficient 
of and the products of mismatched pair cannot be placed at for any i. However, carries can 
violate the claim. One can avoid a over-long carry by not assigning even numbers for indices or 
not assigning odd numbers for indices because log < a. 

Every progression-free set 5 C {1, 2,... , A:} has size at most for any constant e > 0 [71123] 
and there exist efficient algorithms for finding one set of that size [3l ll0ll22ll24j . On a TM, one can 
use Behrend’s algorithm [3], which relies on finding a hyperball containing sufficiently many lattice 
points on it, to find such a set. This can be reduced to multiplications as does Lemma 13.61 By the 
Pigeon-hole principle, at least half the integers in S are even or odd. Therefore, we can multiply (. 
pairs of two o-bit integers by computing the product of two fca-bit integers. ■ 

Lemma 4.3 The products of a long multiplication can he computed by the products of independent 
short multiplications; in particular, 

= 0(Mfc(n)). 

Proof: We partition the /c^/^n-bit integers into chunks. Then, to compute the product of the 
integers, we compute the products of pairwise chunks and then sum the products up. There are k 
pairs of chunks and they have no dependency. That means the product of pairwise chunks can be 
computed in parallel, completing the proof. ■ 

Since we restrict that the factorial n\ is output by a multiplication, there must be a multiplication 
ai X 6 i = ao = n! in every algorithm. Besides, we restrict that only the integers of o(n log n) bits 
and intermediate products can be multiplied. Therefore, ai, 6 i are small integers or the computed 
intermediate products. Let |x| denote the number of bits in x. 

If |aj| > |ao|/2, then a, has more than o(nlogn) bits. Therefore, a* is also an intermediate 
product and assert the existence of a multiplication Oj+i x 6 j+i = a*. We can repeat this until some 
|ai| < |flo|/2. We define t to be the step where it stops. Therefore, there must be t multiplications, 
Qi X hi = Oi-i for all i E in any algorithm that can compute the factorial. In other words, we 
have a lower bound of 

W.l.o.g., let I a* I > \bi\ and therefore |aj| > | no 1/4 for all i E 

Let us simplify Equation (j3|) by observing the distribution of 6 j’s. Consider that 

at Y\. bi = ao and ^ \bi\ > |ao| - \at\, 

then jjL = (| 6 i| -|- \b 2 \ \bt\)/t > |ao|/(2t). Furthermore, for any 7 E if there is no hi more 

than 7 /i, then there are t/^ bfs more than fJ,/2, which is an extension of Markov’s inequality. We 
are ready to show the lower bound in Lemma 14.41 


Lemma 4.4 Computing the factorial n! has a lower bound 




—nloffn 
t ^ 


where t is a parameter to be determined later. 

Proof: By applying the extended Markov inequality to Equation (j4]), one has the lower bound 


E M(|ai|,|6,l) 


> max min 


|M(|ao|/4,7/i), ^M(|ao|/4,/x/2) 


which is, by Lemma l4.ll more than 


max min 



nlogn 


27 


-M 


nlogn 


We convert the two terms to the same form and compare. We apply Lemma 14.21 for the hrst term 
and the mentioned Mfc(n) < /cM(n) bound for the second term, thus obtaining 


max mm 

7G[l,t] 


M 


(27)1 


—nlogn ,M_l 

zt / 27 


n log n 


for any constant e > 0. Observe that Mfc(a) < M£(a) k < i. As a result, we have the following 
lower bound, by choosing 7 = t ^/^+^/2 for any constant e > 0 , 


n 



p/2-e 


-n log n 


(5) 


Observe that Lemma 14.41 yields a good lower bound only if t is small. Our strategy is to find 
another lower bound which is good when t is large. Then, we can trade off between these lower 
bounds. We finalize the proof for the claimed lower bound in Theorem 14.51 


Theorem 4.5 On a TM, eomputing the faetorial n! has a lower bound 

f 2 (n log^/'^“'^ n) for any constant e > 0 . 

Proof: By Lemma l4.11 one has 

E l^il) ^ E 

Combining the above lower bound and the lower bound shown in Lemma 14.41 we obtain 




min max 
t 



tl/2-e 


nlogn 


tn log n 


( 6 ) 


Again, we convert the two terms to the same form and compare. We apply Lemma 14.31 for the first 
term and apply the current upper bound of M(n) < nlogn2‘^(^°® for the second term. Then, 
the lower bound becomes 


n 


min max M 

t 


n log n \ 
^3/4+e j 



A 

20(log* n) J J' 


( 7 ) 


The optimal bound appears at t 


' n for any constant e > 0 as desired. 
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Corollary 4.6 On a log-RAM, computing the factorial n! has a lower bound 

n) for any constant e > 0. 

Proof: We replace the lower bound of M(n) in Equation [6] with 0(n/logn) and replace the upper 
bound of M(n) in Equation [7] with 0{n). By similar analysis, we are done. ■ 
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5 Background and Related Work (Cont’d) 

5.1 Computing Binomial Coefficients 

Consider, as an example, the computation of the central binomial coefficient ^ simple algo¬ 

rithm for which is to compute n! and {n/2)\ independently and divide n! by the square of (n/2)!. 
However, n! and (n/2)! each have 0(nlogn) bits, which is much more than the 0(n) bits that (^” 2 ) 
has. One can do something clever by cancelling the common factors between the numerator and 
denominator. Eor example, when n a multiple of 24, 




n/2 


n/8 


n/3 

n/6 


n/12 

n/24 


-1 


where Qn is the product of positive integers in {A: < n | gcd{k, 6) = 1}. Eor n not a multiple of 24, 
there are at most 23 = 0(1) further multiplications needed to compute the value. 

This approach reduces the number of multiplications from n to 19n/24. Some of these multi¬ 
plications can be further reduced by a recursive call; however, Qn requires H(n) multiplications. 

0 ne can reduce the number of multiplication required for Qn by letting Qn be the product of 
integers in {k < n \ gcd{k,p) = 1 for each prime p < t}, where t is a chosen threshold. By Merten 
theorem [9], Qn is a product of 0(n/log t) integers. To make Qn be a product of 0(n/logn) 
integers, it is necessary to sieve out the multiples of primes. The running time for this 

matches that for computing prime tables if Schonhage algorithm is used. 

Suppose a smaller t is chosen, this approach needs H((logn)M(nlogn/logt)) time, which is 
more than M(nlogn), assuming that the conjecture M(n) = 0(nlogn) holds. 
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A folk algorithm to compute the binomial coefficient more efficiently given is based on 
Rummer’s theorem [20], stating that for each prime p, the largest natural number r such that 
divides (^) can be computed in 0(logn/logp) time. We analyze the complexity of this approach 
and show in Section [6] that it is 0(M(n) + P(n)). 


5.2 Computing n\ 

There exist several efficient algorithms to calculate n! [5l|6[|25l[27[|28] . Some lai focus on reducing 
the total number of bits of intermediate products by grouping the n integers into sub-groups, for 
example by commuting the product of each pair of successive integers, and then each pair of those 
products, and so on. The total number of bits of intermediate products is then greatly reduced to 
0{n\o^ n). 

Others [6l[25l[27ll28] focus on reducing the amount of shared computation between multiplica¬ 
tions. The idea is to use the observation that can be computed via O(logn) multiplications, 
with intermediate products ... ,p”, instead of by 0{n) iterative multiplications by p. In 

order to use this to compute n!, such algorithms decompose n! into prime factors, say pYp 2 ^ ' ', 
and achieve their speedups by carefully scheduling multiplications in order to reduce the number 
of intermediate products. 

Borwein [6] divides the factors pi into C>(logn) groups Gi,G 2 , ■ ■ ■ where 

Gj = {pi I the j-th bits of is 1 in base 2}. 

Let TTj = Wp^Q. p- Since each factor in the same group Gj has the same exponent, then 

n! = '^‘f ^ can compute the product tTj first and compute its power nj ^ later. This 

greatly reduces the amount of shared computation. Borwein shows that this approach runs in 
0(M(nlog n) log logn-|-P(n)) time. Note that, as Schonhage pointed out, Borwein did not include 
the time to compute the prime table but took the table as given. 

Schonhage et al. |25j presented a variation of Borwein’s algorithm by factoring n! as follows; 

n! = 7ri(7r2(7r3(7r4 • • • (8) 


This approach takes advantages on the fact that multiplying before exponentiating is faster than 
exponentiating each term in a product independently. This algorithm has run time 0(M(nlogn) -|- 
P(n)). Schonhage gave an 0(nlog^ n log log n) algorithm to compute a prime table. At the time of 
publication, this constituted a log log n factor improvement over Borwein’s algorithm for computing 
n!. Given Fiirer’s improvement on multiplication, this improvement is down to 2^^^°® 

Using an approach similar to Schonhage’s, Vardi |28] independently gave an algorithm based 
on the identity: 


n\ = 


/ n \ / /n/2\ //n/4\//n/8\ 
\n/2) V Wv V Wsy \\n/l6) 



for n = 2*’. 


(9) 


One might wonder what the difference is between Equations ([8]) and ([9]) at the first glance. Note 
that TTi = (^” 2 ) (n/ 2 ) squarefree and similar to other TTj’s. However, Erdos’ squarefree 

conjecture m states that is never squarefree for n > 4. This was proved by Granville and 
Ramare |15] . This implies that Schonhage’s algorithm performs fewer multiplications than Vardi’s. 
Vardi did not analyze the complexity his algorithm. We analyze Vardi’s algorithm in Section [6] and 
show that it has the same asymptotic complexity as Schonhage’s, that is 0(M(nlogn) -|-P(n)), as 
long as the binomial coefficients are computed in time 0(M(n) -|- P(n)). 
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However, it is possible that a faster algorithm to compute binomial coefficients exists, one that 
does not rely on prime table computation. Therefore, there exists some hope that the second 
term in the time complexity might be removed, even if no faster algorithm is given for prime table 
computation. 


6 Factorials and Binomials 

We analyze the complexity of computing the factorial n! by Vardi’s algorithm [28] . Since Vardi’s 
algorithm relies on the computation of central binomial coefficients, we begin by analyzing the 
complexity of computing the binomial coefficient . 

6.1 Computing Binomial Coefficients in C)(M(n) +P(n)) Time 

It is known that binomial coefficients can be efficiently computed by Rummer’s Theorem [2ntl28j. 
However, the exact complexity is not known. Here we give an analysis. 

Rummer’s Theorem [20] states that, for any binomial coefficient (^), any prime p, the maximum 
integer r such that divides is equal to the number of carries occur when adding n — k and 
k in base p. Therefore, the prime factorization p^iP^ ■ ■ ■ of can be computed by trying 
every possible prime from 2 to n. Each trial requires 0((logp n)M(log n)) time because division 
and modular arithmetics on 0(logn)-bits integers require 0(M(logn)) time [19]. Hence, the prime 
factorization of (^) can be obtained in 0{M.{n)) time due to Lemma l 6 .ll 

Lemma 6.1 Let p^p^^ '''PT prime faetorization of (^). Then, 

^ logpU = 0(n/logn). 

prime p<n j 

Proof: By Rummer’s Theorem [20], we have rj = 0(logp. n). Since 

^ logpn< ^ log 2 n+ ^ \og^n, 

prime p<n prime pi<7 prime pie[7,n] 

choosing 7 as n/logn gives the bound 0 (n/logn). ■ 

Then, multiplying the prime factors pairwise until their product (^) is computed gives the 
running time shown in Theorem 16.21 

Theorem 6.2 A binomial eoeffieient can he eomputed in 0(M(n)) time given a prime table 
from 2 to re. 

For Vardi’s algorithm, we only care about central binomial coefficients, but of course these are 
just a special case of this theorem. 




i<t 


6.2 Factorial is in C)(M(relogre)) 

Vardi compute the factorial re! by the identity 


re! = ( (re) 


re 

re /2 


(re/2)! (re/2)!, 


( 10 ) 
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where r = n (mod 2) and n/2 denotes integral division, i.e., n/2 = [re/2j. Note that there are four 
terms on the R.H.S. of the identity and each has O(nlogn) bits. Let F(n) denote the running time 
for computing the factorial n\. Then, we have the following recurrence relation, 

F(n) = F(n/2) + C)(M(nlogn)) (11) 

due to Theorems 13.91 and 16.21 Therefore, we have Theorem 16.31 
Theorem 6.3 The factorial re! can be computed in 0(M(relogre)) time. 
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